This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML and PDF documents. R Markdown documents are useful for interspersing formatted text with code. There are many ways to write formatted text within markdown, and RStudio hosts a good cheatsheet (See section #3).
You run R code within chunks either line-by-line (ctrl-enter or cmd-enter), or all at once (see keyboard shortcuts). To do so, place your cursor on a line of code, and then press the keys to run the line.
Here is a chunk of R code that installs all of the packages that will be used during the workshop. You should uncomment all of the lines (remove the #), select all lines, and then hit ctrl-enter or cmd-enter. You can also install packages using the install button on the packages pane in RStudio.
# install.packages('ggmap')
# install.packages('tidyverse')
# install.packages('gapminder')
# install.packages('cowplot')
The ggmap package is used for plotting maps and spatial data. We will use it today to learn how to run code in R, and to play around with functions a bit.
Note: You can supply arguments to chunks that can customize the behavior of those chunks. This chunk has the argument cache=TRUE supplied to it. This just tells the chunk to store the comet variable, which makes the document quicker to run if you convert it to html multiple times.
library(ggmap)
## Plot a map of texas -- Note it searches an online database for maps matching "texas"
qmap("texas", zoom=6, color="bw")
## Plot a map of UT now
qmap("University of Texas at Austin", zoom=15)
# You can create a variable with the "=" sign
pcl_location = geocode("101 E 21st St, Austin, TX 78712", source = "google")
## Now use a ggmap function to plot the map with the point.
## This is a function that is strung together with a "+"
ggmap(get_map("University of Texas at Austin", zoom = 15)) +
geom_point(data=pcl_location, size = 7, shape = 13, color = "red")
Check out the environment pane on the top right of the RStudio screen. What do you notice? I tend to glance at that pane every once in a while to make sure variables are being created and changed as expected. e.g. we created the pcl_location variable in the previous chunk, so we can check to make sure it’s there.
Copy one of the qmap() lines of code from the previous chunk, and paste it in the next chunk. Change the number for the zoom parameter (Can only be 3-21), and change the location within the quotes. Can you get a map of Africa, how about one of your hometown?
# R Code here
Create a new variable called home that has your current or past home address saved. Then plot it on a map like we did for PCL.
# R Code here
Now that you know how to use R Markdown and have learned about variables and functions, let’s play around with data frames using the tidyverse. Follow along with this during the presentation portion. The tidyverse is a system of packages created by Hadley Wickham that provide consistent and intuitive syntax for manipulating, analyzing, and visualizing data.
filter, select, and %>%The filter and select functions allow for easy subsetting of the data (selecting specific portions). filter extracts rows fulfilling a specified expression, and select extracts columns specified by name or index (number). You can also select columns by giving the function columns you don’t want by simply adding a “-” in front of the column name.
We can link functions/commands together using the %>% operator. %>% takes the output of the left function or variable and puts it by default as the first argument to the right function. So df %>% head() is the same as head(df).
library(tidyverse)
library(gapminder)
# These next two lines do the same thing
gapminder %>% head()
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
head(gapminder)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
# Example filtering rows
gapminder %>% filter(year==1952)
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Albania Europe 1952 55.230 1282697 1601.0561
## 3 Algeria Africa 1952 43.077 9279525 2449.0082
## 4 Angola Africa 1952 30.015 4232095 3520.6103
## 5 Argentina Americas 1952 62.485 17876956 5911.3151
## 6 Australia Oceania 1952 69.120 8691212 10039.5956
## 7 Austria Europe 1952 66.800 6927772 6137.0765
## 8 Bahrain Asia 1952 50.939 120447 9867.0848
## 9 Bangladesh Asia 1952 37.484 46886859 684.2442
## 10 Belgium Europe 1952 68.000 8730405 8343.1051
## # ... with 132 more rows
# These next two lines result in the same data frame
gapminder %>% select(country:year)
## # A tibble: 1,704 × 3
## country continent year
## <fctr> <fctr> <int>
## 1 Afghanistan Asia 1952
## 2 Afghanistan Asia 1957
## 3 Afghanistan Asia 1962
## 4 Afghanistan Asia 1967
## 5 Afghanistan Asia 1972
## 6 Afghanistan Asia 1977
## 7 Afghanistan Asia 1982
## 8 Afghanistan Asia 1987
## 9 Afghanistan Asia 1992
## 10 Afghanistan Asia 1997
## # ... with 1,694 more rows
gapminder %>% select(country, continent, year)
## # A tibble: 1,704 × 3
## country continent year
## <fctr> <fctr> <int>
## 1 Afghanistan Asia 1952
## 2 Afghanistan Asia 1957
## 3 Afghanistan Asia 1962
## 4 Afghanistan Asia 1967
## 5 Afghanistan Asia 1972
## 6 Afghanistan Asia 1977
## 7 Afghanistan Asia 1982
## 8 Afghanistan Asia 1987
## 9 Afghanistan Asia 1992
## 10 Afghanistan Asia 1997
## # ... with 1,694 more rows
# We can link multiple statements together
# So if we want only the population data for year 1952 we could do this:
gapminder %>% filter(year==1952) %>%
select(country, year, pop)
## # A tibble: 142 × 3
## country year pop
## <fctr> <int> <int>
## 1 Afghanistan 1952 8425333
## 2 Albania 1952 1282697
## 3 Algeria 1952 9279525
## 4 Angola 1952 4232095
## 5 Argentina 1952 17876956
## 6 Australia 1952 8691212
## 7 Austria 1952 6927772
## 8 Bahrain 1952 120447
## 9 Bangladesh 1952 46886859
## 10 Belgium 1952 8730405
## # ... with 132 more rows
Try to use filter() and select() to subset your data to include only the country, year, and life expectancy data from Belgium.
# R Code here
mutate, group_by, and summariseThe mutate is a function that can add columns to data frames. Often times we make new columns in data frames out of combinations of old columns, and mutate makes this fairly straightforward.
group_by and summarise are commonly used in tandem. Often times we want to summarise data for a specific group of data. For example if we had data for the heights of all people on campus, we might want to know the mean for the two genders. We first would group our data by the gender, and then summarise the data with the mean.
# Add a gdp column using the per capita gdp and total population
gapminder %>% mutate(gdp = gdpPercap * pop)
## # A tibble: 1,704 × 7
## country continent year lifeExp pop gdpPercap gdp
## <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530 7585448670
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007 8758855797
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971 9648014150
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811 9678553274
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134 11697659231
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114 12598563401
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959 11820990309
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414 10595901589
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414 14121995875
## # ... with 1,694 more rows
# Notice how the group component is added on
gapminder %>% group_by(year)
## Source: local data frame [1,704 x 6]
## Groups: year [12]
##
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
# Once things are grouped, we can summarize multiple rows like this
gapminder %>% group_by(year) %>%
summarise(mean_pc_gdp = mean(gdpPercap))
## # A tibble: 12 × 2
## year mean_pc_gdp
## <int> <dbl>
## 1 1952 3725.276
## 2 1957 4299.408
## 3 1962 4725.812
## 4 1967 5483.653
## 5 1972 6770.083
## 6 1977 7313.166
## 7 1982 7518.902
## 8 1987 7900.920
## 9 1992 8158.609
## 10 1997 9090.175
## 11 2002 9917.848
## 12 2007 11680.072
# This is how we would save the resultant data frame, and we could
# do this for any of the previous chunk statements.
avg_gdp_by_year = gapminder %>% group_by(year) %>%
summarise(mean_pc_gdp = mean(gdpPercap))
Try to use mutate(), group_by, and summarise to subset your data to add on a gdp column to gapminder, and then find the average gdp for each country.
# R Code here
Now we’re going to discuss visualizing data using the ggplot2 package. In doing so, we will also discover why tidy data is easy to work with, and therefore will learn about data reshaping/manipulation using tidyr.
The ggplot2 package revolves around the ggplot function. We first specify the data, then the aesthetics, and finally the type of plot we would like to make. Further customization is possible, but we won’t have time to talk much about those. the ggplot2 documentation is a really helpful reference for understanding how to customize plots.
We will work with the pew dataset that is part of the tidyr vignette, but we need to download it from online. The data contain a number of religious affiliations, and then frequency of followers falling into a variety of income brackets.
library(cowplot) # I prefer cowplot to ggplot default themes
pew = read_csv("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/pew.csv")
# Plot a scatterplot of the number of individauls in <10k bracket versus >150k
pew %>% ggplot(aes(x = `<$10k`, y = `>150k`)) +
geom_point()
# Plot bar plot for <10k bracket for all religions
# The x-axis labels overlap, but we can customize those
pew %>% ggplot(aes(x = religion, y = `<$10k`)) +
geom_bar(stat = "identity")
But what if we wanted to plot multiple income brackets? We would need to specify each column individually, and then somehow manually arrange them on the figure. This is where the tidyr package comes into play, which we will learn about next.
Use the gapminder dataset and try to create a boxplot (geom_boxplot()) for the life expectancy of each continent.
# R Code here
The tidyr package contains many useful functions for cleaning and reshaping your data, but we will mainly talk about two of those (spread and gather). These functions are used to respectively convert long data to wide and wide data to long. Since it’s more common for data to start out in wide format, we will primarily focus on the gather function. Tidy data are defined by the following two characteristics:
What form are our pew data in?
pew %>% head()
## # A tibble: 6 × 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k`
## <chr> <int> <int> <int> <int> <int>
## 1 Agnostic 27 34 60 81 76
## 2 Atheist 12 27 37 52 35
## 3 Buddhist 27 21 30 34 33
## 4 Catholic 418 617 732 670 638
## 5 Don’t know/refused 15 14 15 11 10
## 6 Evangelical Prot 575 869 1064 982 881
## # ... with 5 more variables: `$50-75k` <int>, `$75-100k` <int>,
## # `$100-150k` <int>, `>150k` <int>, `Don't know/refused` <int>
gatherThe pew data are in what we call “wide” data format, where each row corresponds to a class and the columns correspond to characterisitics (observations) of that class. If we wanted to plot a bar for each observation, we need to convert our “wide” data to “long” formatting using the gather function from tidyr.
The format of gather should be as follows: gather(key = observation_name, value = data_name, columns_being_gathered) Where the observation_name is what you want to call the column that stores the column that stores the observation type, data_name should be what you want the column that stores the specific data values to be called, and columns_being_gathered should be the columns (using the same syntax as select) that you want to be gathered. spread has similar syntax and can be used to reverse gathering.
# We want all columns gathered except for the religion column:
pew %>% gather(key=income, value=frequency, -religion)
## # A tibble: 180 × 3
## religion income frequency
## <chr> <chr> <int>
## 1 Agnostic <$10k 27
## 2 Atheist <$10k 12
## 3 Buddhist <$10k 27
## 4 Catholic <$10k 418
## 5 Don’t know/refused <$10k 15
## 6 Evangelical Prot <$10k 575
## 7 Hindu <$10k 1
## 8 Historically Black Prot <$10k 228
## 9 Jehovah's Witness <$10k 20
## 10 Jewish <$10k 19
## # ... with 170 more rows
# Alternatively:
pew %>% gather(key=income, value=frequency, 2:11)
## # A tibble: 180 × 3
## religion income frequency
## <chr> <chr> <int>
## 1 Agnostic <$10k 27
## 2 Atheist <$10k 12
## 3 Buddhist <$10k 27
## 4 Catholic <$10k 418
## 5 Don’t know/refused <$10k 15
## 6 Evangelical Prot <$10k 575
## 7 Hindu <$10k 1
## 8 Historically Black Prot <$10k 228
## 9 Jehovah's Witness <$10k 20
## 10 Jewish <$10k 19
## # ... with 170 more rows
# Use the spread to reverse gathering
pew %>% gather(key=income, value=frequency, 2:11) %>%
spread(income, frequency)
## # A tibble: 18 × 11
## religion `<$10k` `>150k` `$10-20k` `$100-150k` `$20-30k`
## * <chr> <int> <int> <int> <int> <int>
## 1 Agnostic 27 84 34 109 60
## 2 Atheist 12 74 27 59 37
## 3 Buddhist 27 53 21 39 30
## 4 Catholic 418 633 617 792 732
## 5 Don’t know/refused 15 18 14 17 15
## 6 Evangelical Prot 575 414 869 723 1064
## 7 Hindu 1 54 9 48 7
## 8 Historically Black Prot 228 78 244 81 236
## 9 Jehovah's Witness 20 6 27 11 24
## 10 Jewish 19 151 19 87 25
## 11 Mainline Prot 289 634 495 753 619
## 12 Mormon 29 42 40 49 48
## 13 Muslim 6 6 7 8 9
## 14 Orthodox 13 46 17 42 23
## 15 Other Christian 9 12 7 14 11
## 16 Other Faiths 20 41 33 40 40
## 17 Other World Religions 5 4 2 4 3
## 18 Unaffiliated 217 258 299 321 374
## # ... with 5 more variables: `$30-40k` <int>, `$40-50k` <int>,
## # `$50-75k` <int>, `$75-100k` <int>, `Don't know/refused` <int>
Let’s say you want to drop all of the pew data of people who are in the Don't know/refused column. Gather all of the income frequencies for all people in the columns that answered the question, without the Don't know/refused column.
Does the order in which you do things matter?
# R Code here
We now have gone through all of the basics for data manipulation, analysis, and plotting. The cool thing about the tidyverse is that you can link as many of these pipes together as you’d like. I’d say you probably want to limit to a reasonable number for readability, but we can explore the pew data much more fully now. What if we wanted to plot the bar chart we had before, but with bars for all of the income brackets colored in?
pew %>% gather(income, frequency, -religion) %>%
ggplot(aes(x = religion, y = frequency, fill=income)) +
geom_bar(stat="identity")
Customizing specific theme elements takes a bit more work, but can usually be solved through a quick google search. For example, we can fix the x axis labels this way:
pew %>% gather(income, frequency, -religion) %>%
ggplot(aes(x = religion, y = frequency, fill=income)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45, hjust = 1, vjust = 1))
Let’s go back to the gapminder dataset, as it is slightly more interesting for more complex examples.
# We can plot lines for the life expectancy over time for each country
gapminder %>%
ggplot(aes(year, lifeExp, group=country)) +
geom_line()
# We can add a overall trendline
gapminder %>%
ggplot(aes(year, lifeExp)) +
geom_point() +
stat_smooth()
What if we wanted to plot the mean life expectancy for each year across all of the countries? Try using everything you’ve learned to make that plot in a few concise lines!
# R Code here
Now try plotting the mean life expectancy for just the continent of Africa. You should be able to do it by just adding one line to the chunk above!
# R Code here
If you’re up for a bit of a challenge try to make a scatterplot of the mean life expectancy versus the mean per capita gdp for each country and year. Hint: you can summarise two things at once to create two summary columns, and you can also group_by multiple columns.
# R Code Here